Code
| youtube | newspaper | sales | ||
|---|---|---|---|---|
| 1 | 276.12 | 45.36 | 83.04 | 26.52 |
| 2 | 53.40 | 47.16 | 54.12 | 12.48 |
| 3 | 20.64 | 55.08 | 83.16 | 11.16 |
| 4 | 181.80 | 49.56 | 70.20 | 22.20 |
| 5 | 216.96 | 12.96 | 70.08 | 15.48 |
Exploratory Data Analysis & Unsupervised Learning
Motivation
Simple Linear Regresion
Multiple Linear Regression
Model Evaluation
Model refinement: Regularization
| youtube | newspaper | sales | ||
|---|---|---|---|---|
| 1 | 276.12 | 45.36 | 83.04 | 26.52 |
| 2 | 53.40 | 47.16 | 54.12 | 12.48 |
| 3 | 20.64 | 55.08 | 83.16 | 11.16 |
| 4 | 181.80 | 49.56 | 70.20 | 22.20 |
| 5 | 216.96 | 12.96 | 70.08 | 15.48 |
🧐 How would you represnt everything in one graph?

“Where there is data smoke, there is business fire.” — Thomas Redman
| youtube | newspaper | sales | ||
|---|---|---|---|---|
| 1 | 276.12 | 45.36 | 83.04 | 26.52 |
| 2 | 53.40 | 47.16 | 54.12 | 12.48 |
| 3 | 20.64 | 55.08 | 83.16 | 11.16 |
| 4 | 181.80 | 49.56 | 70.20 | 22.20 |
| 5 | 216.96 | 12.96 | 70.08 | 15.48 |
\[{\cal D}=\begin{bmatrix} X_1 & \dots & X_d & Y\\ x_{11} & \dots & x_{1d} & y_1\\ x_{21} & \dots & x_{2d} & y_2\\ \vdots & \ddots & \vdots & \vdots\\ x_{n1} & \dots & x_{nd} & y_n \end{bmatrix}\]
Objective
Using input \(\text{x}\) to predict its corresponding target \(y\).
Residual Sum of Squares: \(\begin{align*}
\text{RSS}&=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2\\
&=\sum_{i=1}^n(\color{red}{y_i-\beta_0-\beta_1x_i})^2
\end{align*}\)Ordinary Least Squares (OLS):
The best-fitted line minimizes RSS.
Optimal Least-Square Line
Optimal line: \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1\text{x}\) where
\[\begin{align} \hat{\beta}_1&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\beta}_0&=\overline{y}_n-\hat{\beta}_1\overline{\text{x}}_n\end{align}, \] with
from sklearn.linear_model import LinearRegression # import model
lr = LinearRegression() # initiate model
x_train, y_train = df1[['youtube']], df1['sales'] # training input-target
lr = lr.fit(x_train, y_train) # build model = esimate coefficients
# Training data and fitted line
pred_train = lr.predict(x_train)
# Figures
fig_market2 = go.Figure(go.Scatter(x=x_train.youtube, y=y_train, mode="markers", name="Training data"))
fig_market2.add_trace(go.Scatter(x=x_train.youtube, y=pred_train, mode="lines+markers", name=f"<br>Train prediction<br> Sale={np.round(lr.coef_,2)[0]}youtube+{np.round(lr.intercept_,2)}"))
fig_market2.update_layout(title="Sales vs youtube",
xaxis=dict(title="youtube"),
yaxis=dict(title="sales"),
width=600, height=400)
fig_market2.show()youtube) can capture around np.float64(61.0)% of the variation of the target (sales).res = pred_train-y_train # Compute residuals
from plotly.subplots import make_subplots
fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))
fig_res.add_trace(
go.Scatter(x=pred_train, y=res, name="Residuals", mode="markers"),
row=1, col=1)
fig_res.add_trace(
go.Scatter(x=[np.min(pred_train), np.max(pred_train)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"),
row=1, col=1)
fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)
fig_res.add_trace(
go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)
fig_res.update_layout(width=950, height=250)
fig_res.show()Summary
Sales = np.float64(0.048) YouTube + np.float64(8.439).Sales is expected to increase (or decrease) by np.float64(0.048) units for every \(1\) unit increase (or decrease) in YouTube ads.Pearson’s correlation coefficient
causation; it only indicates a relationship, not a cause-and-effect link.| youtube | newspaper | sales | ||
|---|---|---|---|---|
| youtube | 1.000000 | 0.054809 | 0.056648 | 0.782224 |
| 0.054809 | 1.000000 | 0.354104 | 0.576223 | |
| newspaper | 0.056648 | 0.354104 | 1.000000 | 0.228299 |
| sales | 0.782224 | 0.576223 | 0.228299 | 1.000000 |
YouTube is strongly correlated with target Sales and is most useful for building models, followed by Facebook and Newspaper.Facebook and Newspaper have a significantly larger correlation with each other than with YouTube.Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]The
prediction\(\hat{Y}\) of \(Y\) by MLR is theprojectionof the target \(Y\) onto the subspace spanned by columns of \(X\).
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
x_train, y_train = df1[["facebook", "newspaper"]], df1.sales
mlr.fit(x_train, y_train) # Fit model
y_hat = mlr.predict(x_train) # Predict
x_surf0 = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)],
[np.max(x_train.facebook), np.min(x_train.newspaper)],
[np.min(x_train.facebook), np.max(x_train.newspaper)],
[np.max(x_train.facebook), np.max(x_train.newspaper)]])
y_surf = mlr.predict(x_surf0).reshape(2,2)
x_surf = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)],
[np.max(x_train.facebook), np.max(x_train.newspaper)]])
frames = []
frames.append(
go.Frame(
data=[go.Scatter3d(x=x_train.facebook,
y=x_train.newspaper,
z=y_train,
mode="markers",
marker=dict(size=5),
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>'),
go.Surface(x=x_surf[:,0],
y=x_surf[:,1],
z=y_surf,
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>')],
name=f"Level: 0"
))
alpha = np.linspace(0,1,10)
for alp in alpha[1:]:
y_proj = y_train + alp * (y_hat - y_train)
frames.append(go.Frame(
data=[go.Scatter3d(x=x_train.facebook,
y=x_train.newspaper,
z=y_proj,
mode="markers",
marker=dict(size=5),
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>'),
go.Surface(x=x_surf[:,0],
y=x_surf[:,1],
z=y_surf,
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>')],
name=f"Level: {np.round(alp,3)}"
))
# Add scatter plot and first polynomial fit to the initial figure
fig_mlr = go.Figure(
data=[
go.Scatter3d(x=x_train.facebook,
y=x_train.newspaper,
z=y_train,
mode="markers",
marker=dict(size=5),
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>'),
go.Surface(x=x_surf[:,0],
y=x_surf[:,1],
z=y_surf,
hovertemplate='facebook: %{x}<br>' +
'newspaper: %{y}<br>' +
'sales: %{z}<extra></extra>')
],
layout=go.Layout(
title=f"Sales = {np.round(mlr.intercept_,3)}+{np.round(mlr.coef_[0],3)}Facebook+{np.round(mlr.coef_[1],3)}Newspaper",
xaxis=dict(title="Facebook", range=[np.min(df1["facebook"])*0.9, np.max(df1["facebook"])*1.1]),
yaxis=dict(title="Newspaper", range=[np.min(df1["newspaper"])*0.9, np.max(df1["newspaper"])*1.1]),
updatemenus=[{
"buttons": [
{
"args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
"label": "Play",
"method": "animate"
},
{
"args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
"label": "Stop",
"method": "animate"
}
],
"type": "buttons",
"showactive": False,
"x": 0.1,
"y": 1.2,
"pad": {"r": 3, "t": 50}
}],
sliders=[{
"active": 0,
"currentvalue": {"prefix": "Project "},
"pad": {"t": 50},
"steps": [{"label": f"Level: {np.round(alp,3)}",
"method": "animate",
"args": [[f"Level: {np.round(alp,3)}"], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate",
"transition": {"duration": 10}}]}
for alp in alpha]
}]
),
frames=frames
)
fig_mlr.update_layout(height=450, width=800, scene_camera=camera,
scene=dict(
zaxis=dict(title="Sales", range=[np.min(df1.sales)*0.9, np.max(df1.sales)*1.1])
))
fig_mlr.update_scenes(xaxis_title_text= "Facebook",
yaxis_title_text= "Newspaper",
zaxis_title_text="Sales"
)
fig_mlr.show()resid = y_train - y_hat # residuals
from plotly.subplots import make_subplots
fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))
fig_res.add_trace(
go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"),
row=1, col=1)
fig_res.add_trace(
go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"),
row=1, col=1)
fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)
fig_res.add_trace(
go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)
fig_res.update_layout(width=950, height=350)
fig_res.show()Summary
Sales = np.float64(11.027)+np.float64(0.199) Facebook+np.float64(0.007) Newspaper.facebook ad is increased (or decreased) by \(1\) unit, sales is expected to increase (or decrease) by np.float64(0.199) units.Facebook and Newspaper together, which is not enough to be a good model!box-cox transformation:\(y_i\to \tilde{y}=\begin{cases}\frac{y_i^{\lambda}-1}{\lambda}&\text{if }\lambda\neq 0\\ \log(y_i)&\text{if }\lambda=0.\end{cases}\)
normalityA good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.
building the model.testing the model.A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.
import rdata
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error
market = rdata.read_rda(path1)
market = market['marketing']
shuffle_id = np.random.choice(['train', 'test'],
replace=True,
p=[0.75, 0.25],
size=market.shape[0])
market['type'] = shuffle_id
# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(market.loc[market.type == "train", ['youtube']], market.loc[market.type == "train", "sales"])
y_hat = lr1.predict(market.loc[market.type == "test", ['youtube']])
import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=market,
x="youtube",
y="sales",
color="type",
color_discrete_map={
"train": "#e89927",
"test": "#3bbc35"
})
fig1.add_trace(go.Scatter(x=market.loc[market.type == "test", 'youtube'],
y=y_hat,
name="Model built on train data",
line=dict(color="#e89927")))
fig1.update_layout(width=600, height=250, title="SLR Model: Sales vs Youtube")
fig1.show()






Estimate MSE of New Unseen Data\(^{\text{📚}}\).MSE, such as Accuracy, F1-score,…Test Error.sales vs youtube example: 5-fold CV-MSE = np.float64(22.243) or CV-RMSE = np.float64(4.716).from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()
# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()
lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()continent is useful for predicting lifeExp (for more, see here).
| Africa | Americas | Asia | Europe | Oceania | lifeExp |
|---|---|---|---|---|---|
| 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 43.828000 |
sales vs youtube: \(R^2\approx 61\%\).Tuning Polynomial degree Using \(K\)-fold Cross-Validation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = []
for deg in degree:
pf = PolynomialFeatures(degree=deg)
X_poly = pf.fit_transform(X)
model = LR()
score = -cross_val_score(model, X_poly, y, cv=5,
scoring='neg_mean_squared_error').mean()
loss.append(score)Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation
controll the magnitude of the coefficients.Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]
Recall: SLR & MLR seek to minimize only RSS.
Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
poly = PolynomialFeatures(degree=8)
X_poly = poly.fit_transform(X)
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 20000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
model = Ridge(alpha=alp)
score = -cross_val_score(model, X_poly, y, cv=5,
scoring='neg_mean_squared_error').mean()
loss.append(score)
# Fit
model.fit(X_poly, y)
coefficients[f'alpha={alp}'] = model.coef_Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation
Pros
Cons
loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation
Pros
feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).Cons
